171 research outputs found

    Text Classification Using Association Rules, Dependency Pruning and Hyperonymization

    Full text link
    We present new methods for pruning and enhancing item- sets for text classification via association rule mining. Pruning methods are based on dependency syntax and enhancing methods are based on replacing words by their hyperonyms of various orders. We discuss the impact of these methods, compared to pruning based on tfidf rank of words.Comment: 16 pages, 2 figures, presented at DMNLP 201

    Thematically Reinforced Explicit Semantic Analysis

    Full text link
    We present an extended, thematically reinforced version of Gabrilovich and Markovitch's Explicit Semantic Analysis (ESA), where we obtain thematic information through the category structure of Wikipedia. For this we first define a notion of categorical tfidf which measures the relevance of terms in categories. Using this measure as a weight we calculate a maximal spanning tree of the Wikipedia corpus considered as a directed graph of pages and categories. This tree provides us with a unique path of "most related categories" between each page and the top of the hierarchy. We reinforce tfidf of words in a page by aggregating it with categorical tfidfs of the nodes of these paths, and define a thematically reinforced ESA semantic relatedness measure which is more robust than standard ESA and less sensitive to noise caused by out-of-context words. We apply our method to the French Wikipedia corpus, evaluate it through a text classification on a 37.5 MB corpus of 20 French newsgroups and obtain a precision increase of 9-10% compared with standard ESA.Comment: 13 pages, 2 figures, presented at CICLing 201

    Indica, an Indic preprocessor for TeX. A Sinhalese TeX System

    Get PDF
    International audienceIn this paper a two-fold project is described: the first part is a generalized preprocessor for Indic scripts (scripts of languages currently spoken in India—except Urdu—, Sanskrit and Tibetan), with several kinds of input (LaTeX commands, 7-bit ASCII, CSX, Unicode) and TeX output. This utility is written in standard Flex (the GNU version of Lex), and hence can be painlessly compiled on any platform. The same input methods are used for all Indic languages, so that the user does not need to memorize different conventions and commands for each one of them. Moreover, the switch from one language to another can be done by use of user-defineable preprocessor directives.The second part is a complete TeX typesetting system for Sinhalese. The design of the fonts is described, and METAFONT-related features, such as metaness and optical correction, are discussed.At the end of the paper, the reader can find tables showing the different input methods for the four Indic scripts currently implemented in Indica: Devanagari, Tamil, Malayalam, Sinhalese

    Les math\'ematiques de la langue : l'approche formelle de Montague

    Full text link
    We present a natural language modelization method which is strongely relying on mathematics. This method, called "Formal Semantics," has been initiated by the American linguist Richard M. Montague in the 1970's. It uses mathematical tools such as formal languages and grammars, first-order logic, type theory and λ\lambda-calculus. Our goal is to have the reader discover both Montagovian formal semantics and the mathematical tools that he used in his method. ----- Nous pr\'esentons une m\'ethode de mod\'elisation de la langue naturelle qui est fortement bas\'ee sur les math\'ematiques. Cette m\'ethode, appel\'ee {\guillemotleft}s\'emantique formelle{\guillemotright}, a \'et\'e initi\'ee par le linguiste am\'ericain Richard M. Montague dans les ann\'ees 1970. Elle utilise des outils math\'ematiques tels que les langages et grammaires formels, la logique du 1er ordre, la th\'eorie de types et le λ\lambda-calcul. Nous nous proposons de faire d\'ecouvrir au lecteur tant la s\'emantique formelle de Montague que les outils math\'ematiques dont il s'est servi.Comment: 14 pages, in French. Will appear in the journal Quadrature (http://www.quadrature.info) in 201

    Unicode, XML, TEI, Ω and Scholarly Documents

    Get PDF
    International audienc

    The Khmer Script Tamed by the Lion (of TeX)

    Get PDF
    International audienceThis paper presents a Khmer typesetting system, based on TeX, METAFONT, and an ANSI-C filter. A 128-character of the 8-bit ASCII table for the Khmer script is proposed. Input of text is done phonically (using the spoken order consonant-subscript consonant-second subscript consonant-vowel-diacritic). The filter converts phonic description of consonantal clusters into a graphic TeXnical description of these. Thanks to TeX booleans, independent vowels can be automatically decomposed according to recent reforms of Khmer spelling. The last section presents a forthcoming implementation of Khmer into a 16-bit TeX output font, solving the kerning problem of consonantal clusters

    On TeX and Greek…

    Get PDF
    International audienc

    Virtual Fonts: Great Fun, Not for Wizards Only

    Get PDF
    International audienc

    The Traditional Arabic Typecase, Unicode, TeX and METAFONT

    Get PDF
    International audienc
    • …
    corecore